The concepts outlined by the SQL code below are taken from this LinkedIn learning course, “SQL for Data Analysis” - https://www.linkedin.com/learning/sql-for-data-analysis
knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")
select * from book
| book_id | title | isbn13 | language_id | num_pages | publication_date | publisher_id |
|---|---|---|---|---|---|---|
| 1 | The World’s First Love: Mary Mother of God | 8987059752 | 2 | 276 | 1996-09-01 | 1010 |
| 2 | The Illuminati | 20049130001 | 1 | 352 | 2004-10-04 | 1967 |
| 3 | The Servant Leader | 23755004321 | 1 | 128 | 2003-03-11 | 1967 |
| 4 | What Life Was Like in the Jewel in the Crown: British India AD 1600-1905 | 34406054602 | 1 | 168 | 1999-09-01 | 1978 |
| 5 | Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frogs | 49086007763 | 1 | 80 | 1983-12-29 | 416 |
| 6 | Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II | 73999140774 | 1 | 298 | 2000-04-01 | 96 |
| 7 | William Goldman: Four Screenplays | 73999254907 | 2 | 504 | 2000-05-01 | 95 |
| 8 | The Season: A Candid Look at Broadway | 73999768442 | 1 | 448 | 2004-07-01 | 1136 |
| 9 | The Beatles Complete - Updated Edition | 73999960822 | 1 | 303 | 1986-12-01 | 835 |
| 10 | Working Effectively with Legacy Code | 76092025986 | 1 | 464 | 2004-09-01 | 1591 |
select * from book_language
| language_id | language_code | language_name |
|---|---|---|
| 1 | eng | English |
| 2 | en-US | United States English |
| 3 | fre | French |
| 4 | spa | Spanish |
| 5 | en-GB | British English |
| 6 | mul | Multiple Languages |
| 7 | grc | Greek |
| 8 | enm | Middle English |
| 9 | en-CA | Canadian English |
| 10 | ger | German |
Key date and time functions in SQLite -
SELECT DATE('now') AS "Current Date"; -- returns the current date
| Current Date |
|---|
| 2024-07-05 |
SELECT DATETIME('now') AS "Current Date and Time"; -- returns the current date and time
| Current Date and Time |
|---|
| 2024-07-05 08:21:00 |
SELECT STRFTIME('%Y', 'now') AS "Current Year"; -- returns the current year
| Current Year |
|---|
| 2024 |
SELECT TIME('now') as "Current Time"; -- returns the current time
| Current Time |
|---|
| 08:21:00 |
SELECT CURRENT_DATE as "Current Date"; -- returns the current date
| Current Date |
|---|
| 2024-07-05 |
SELECT CURRENT_TIME as "Current Time"; -- returns the current time
| Current Time |
|---|
| 08:21:00 |
SELECT CURRENT_TIMESTAMP as "Current Time and Date"; -- returns the current date and time
| Current Time and Date |
|---|
| 2024-07-05 08:21:00 |
SELECT status_date as "Status Date" from order_history;
| Status Date |
|---|
| 2023-07-19 15:20:03 |
| 2023-12-12 20:06:08 |
| 2023-12-08 20:19:19 |
| 2023-07-12 18:38:30 |
| 2024-04-15 04:33:17 |
| 2023-12-19 09:43:57 |
| 2024-01-06 22:39:56 |
| 2023-07-25 00:39:09 |
| 2023-09-29 13:29:48 |
| 2023-11-14 17:13:49 |
SELECT customer_id as "Customer ID", order_date as "Order Date" from cust_order WHERE order_date BETWEEN '2023-11-01'
and '2024-05-01' ORDER BY order_date DESC;
| Customer ID | Order Date |
|---|---|
| 1240 | 2024-04-30 23:45:50 |
| 241 | 2024-04-30 23:18:12 |
| 155 | 2024-04-30 21:43:09 |
| 59 | 2024-04-30 19:32:57 |
| 79 | 2024-04-30 19:09:28 |
| 109 | 2024-04-30 19:01:49 |
| 1367 | 2024-04-30 18:50:07 |
| 21 | 2024-04-30 17:35:08 |
| 45 | 2024-04-30 17:34:56 |
| 383 | 2024-04-30 16:41:12 |
SELECT customer_id as “Customer ID”, order_date as “Order Date” from cust_order WHERE order_date < Now();
The above SQL code isn’t possible as there isn’t a Now() function in SQLite.
SELECT customer_id as "Customer ID", order_date as "Order Date" from cust_order WHERE order_date < CURRENT_TIMESTAMP;
| Customer ID | Order Date |
|---|---|
| 2 | 2023-07-19 11:32:03 |
| 2 | 2023-12-12 14:06:08 |
| 3 | 2023-12-08 19:55:19 |
| 4 | 2023-07-12 15:26:30 |
| 4 | 2024-04-14 22:57:17 |
| 4 | 2023-12-19 07:43:57 |
| 5 | 2024-01-06 14:03:56 |
| 6 | 2023-07-24 13:27:09 |
| 6 | 2023-09-29 02:29:48 |
| 7 | 2023-11-14 16:13:49 |
SELECT customer_id as "Customer ID", order_date as "Order Date" from cust_order WHERE order_date > CURRENT_TIMESTAMP;
| Customer ID | Order Date |
|---|
SQLite doesn’t have a YEAR() function but it does have a TIME() function to use.
SELECT customer_id as "Customer ID", order_date as "Order Date" from cust_order WHERE TIME(order_date) > TIME(CURRENT_TIMESTAMP);
| Customer ID | Order Date |
|---|---|
| 2 | 2023-07-19 11:32:03 |
| 2 | 2023-12-12 14:06:08 |
| 3 | 2023-12-08 19:55:19 |
| 4 | 2023-07-12 15:26:30 |
| 4 | 2024-04-14 22:57:17 |
| 5 | 2024-01-06 14:03:56 |
| 6 | 2023-07-24 13:27:09 |
| 7 | 2023-11-14 16:13:49 |
| 7 | 2023-12-23 11:15:16 |
| 8 | 2023-07-04 10:31:55 |
The function SUBSTRING() returns a part of a character string SUBSTRING(string, start, length)
select title as "Book Title" from book WHERE SUBSTRING(Title,1,2) = 'Ro'
| Book Title |
|---|
| Roverandom |
| Rough Crossings: Britain the Slaves and the American Revolution |
| Romanticism |
| Ronia the Robber’s Daughter |
| Robinson Crusoe |
| Rock Star Superstar |
| Rounding the Mark (Inspector Montalbano #7) |
| Roald Dahl: A Biography |
| Rose of No Man’s Land |
| Robinson Crusoe |
The CONCAT(string1, string2,….,string_n) function joins two or more strings together
select CONCAT(first_name,' ',last_name) AS "Full Name" from customer
| Full Name |
|---|
| Ursola Purdy |
| Ruthanne Vatini |
| Reidar Turbitt |
| Rich Kirsz |
| Carline Kupis |
| Kandy Adamec |
| Jermain Giraudeau |
| Nolly Bonicelli |
| Phebe Curdell |
| Euell Guilder |
Double pipes, || , can also be used in place of CONCAT(), in SQLite.
SELECT first_name || ' ' || last_name AS "Full Name"
FROM customer;
| Full Name |
|---|
| Ursola Purdy |
| Ruthanne Vatini |
| Reidar Turbitt |
| Rich Kirsz |
| Carline Kupis |
| Kandy Adamec |
| Jermain Giraudeau |
| Nolly Bonicelli |
| Phebe Curdell |
| Euell Guilder |
The UPPER() and LOWER() functions will return strings as upper case or lower case, respectively.
select UPPER(CONCAT(first_name,' ',last_name)) AS "Full Name" from customer
| Full Name |
|---|
| URSOLA PURDY |
| RUTHANNE VATINI |
| REIDAR TURBITT |
| RICH KIRSZ |
| CARLINE KUPIS |
| KANDY ADAMEC |
| JERMAIN GIRAUDEAU |
| NOLLY BONICELLI |
| PHEBE CURDELL |
| EUELL GUILDER |
select LOWER(CONCAT(first_name,' ',last_name)) AS "Full Name" from customer
| Full Name |
|---|
| ursola purdy |
| ruthanne vatini |
| reidar turbitt |
| rich kirsz |
| carline kupis |
| kandy adamec |
| jermain giraudeau |
| nolly bonicelli |
| phebe curdell |
| euell guilder |
The REPLACE() function replaces a substring with another substring - REPLACE(string, old_string, new_string)
select REPLACE(language_id, '1', "English") as Language from book where language_id = '1'
| Language |
|---|
| English.0 |
| English.0 |
| English.0 |
| English.0 |
| English.0 |
| English.0 |
| English.0 |
| English.0 |
| English.0 |
| English.0 |
select title as "Title" FROM book
| Title |
|---|
| The World’s First Love: Mary Mother of God |
| The Illuminati |
| The Servant Leader |
| What Life Was Like in the Jewel in the Crown: British India AD 1600-1905 |
| Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frogs |
| Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II |
| William Goldman: Four Screenplays |
| The Season: A Candid Look at Broadway |
| The Beatles Complete - Updated Edition |
| Working Effectively with Legacy Code |
TRIM: Removes leading and trailing characters (default is space).
select TRIM(title) as "Trimmed Title" FROM book
| Trimmed Title |
|---|
| The World’s First Love: Mary Mother of God |
| The Illuminati |
| The Servant Leader |
| What Life Was Like in the Jewel in the Crown: British India AD 1600-1905 |
| Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frogs |
| Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II |
| William Goldman: Four Screenplays |
| The Season: A Candid Look at Broadway |
| The Beatles Complete - Updated Edition |
| Working Effectively with Legacy Code |
TRIM with specific characters: Removes specified characters from both ends.
SELECT TRIM(title, 'The') AS "Trimmed Title" FROM book;
| Trimmed Title |
|---|
| World’s First Love: Mary Mother of God |
| Illuminati |
| Servant Leader |
| What Life Was Like in the Jewel in the Crown: British India AD 1600-1905 |
| Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frogs |
| Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II |
| William Goldman: Four Screenplays |
| Season: A Candid Look at Broadway |
| Beatles Complete - Updated Edition |
| Working Effectively with Legacy Cod |
LTRIM with specific characters: Removes specified leading characters.
SELECT LTRIM(title, 'What') AS "Trimmed Title" FROM book;
| Trimmed Title |
|---|
| The World’s First Love: Mary Mother of God |
| The Illuminati |
| The Servant Leader |
| Life Was Like in the Jewel in the Crown: British India AD 1600-1905 |
| Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frogs |
| Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II |
| illiam Goldman: Four Screenplays |
| The Season: A Candid Look at Broadway |
| The Beatles Complete - Updated Edition |
| orking Effectively with Legacy Code |
SELECT LTRIM(title, 'The') AS "Trimmed Title" FROM book;
| Trimmed Title |
|---|
| World’s First Love: Mary Mother of God |
| Illuminati |
| Servant Leader |
| What Life Was Like in the Jewel in the Crown: British India AD 1600-1905 |
| Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frogs |
| Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II |
| William Goldman: Four Screenplays |
| Season: A Candid Look at Broadway |
| Beatles Complete - Updated Edition |
| Working Effectively with Legacy Code |
RTRIM with specific characters: Removes specified trailing characters.
SELECT RTRIM(title, 's') AS "Trimmed Title" FROM book;
| Trimmed Title |
|---|
| The World’s First Love: Mary Mother of God |
| The Illuminati |
| The Servant Leader |
| What Life Was Like in the Jewel in the Crown: British India AD 1600-1905 |
| Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frog |
| Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II |
| William Goldman: Four Screenplay |
| The Season: A Candid Look at Broadway |
| The Beatles Complete - Updated Edition |
| Working Effectively with Legacy Code |
LTRIM and RTRIM can be used in combination with one another -
SELECT RTRIM(LTRIM(title, 'The'), 's') AS "Trimmed Title" FROM book;
| Trimmed Title |
|---|
| World’s First Love: Mary Mother of God |
| Illuminati |
| Servant Leader |
| What Life Was Like in the Jewel in the Crown: British India AD 1600-1905 |
| Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frog |
| Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II |
| William Goldman: Four Screenplay |
| Season: A Candid Look at Broadway |
| Beatles Complete - Updated Edition |
| Working Effectively with Legacy Code |
select UPPER(CONCAT(street_number,' ',street_name,' ',city)) as "Address" from address LIMIT 6
| Address |
|---|
| 57.0 GLACIER HILL AVENUE TORBAT-E JāM |
| 86.0 DOTTIE JUNCTION BEAUMONT |
| 292.0 RAMSEY AVENUE CAYAMBE |
| 5618.0 THACKERAY JUNCTION CALDAS |
| 4.0 2ND PARK NGUNGURU |
| 387.0 NANCY JUNCTION BURIRAO |
select * from Publisher
| publisher_id | publisher_name |
|---|---|
| 1 | 10/18 |
| 2 | 1st Book Library |
| 3 | 1st World Library |
| 4 | A & C Black (Childrens books) |
| 5 | A Harvest Book/Harcourt Inc. |
| 6 | A K PETERS |
| 7 | AA World Services |
| 8 | Abacus |
| 9 | Abacus Books |
| 10 | Abacus Books (London) |
INSERT INTO Publisher (publisher_id, publisher_name) VALUES ('1001', 'UoG Press')
select * from Publisher WHERE publisher_name == "UoG Press"
| publisher_id | publisher_name |
|---|---|
| 1001 | UoG Press |
UPDATE Publisher SET publisher_id = '1002' WHERE publisher_name == "UoG Press"
select * from Publisher WHERE publisher_name == "UoG Press"
| publisher_id | publisher_name |
|---|---|
| 1002 | UoG Press |
DELETE FROM Publisher WHERE publisher_id = '1002'
select * from Publisher WHERE publisher_name == "UoG Press"
| publisher_id | publisher_name |
|---|
import pandas as pd
import sqlalchemy as sa
import plotly.express as px
import plotly.graph_objects as go
from sqlalchemy.engine import create_engine
#
# def run_query(query):
# return pd.read_sql(query, con=connection)
#
# df = run_query("SELECT * FROM cust_order",con=connection )
#creating a new column to mark each order as a 1#
cust_order <- cust_order %>%
mutate(number_orders = 1)
#converting the order date column into a date format#
cust_order$order_date <- as.Date(cust_order$order_date)
cust_order$order_date <- ymd(cust_order$order_date)
#simply bar chart of the no.orders, put into an object to be run through plotly#
p <- ggplot(cust_order, aes(x = order_date, y = number_orders)) +
geom_bar(stat = "identity", fill = "steelblue") +
theme_minimal() +
labs(title = " Simple Bar Chart", x = "Order Date", y = "No.Orders") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#display the bar chart#
p
We can use the ggplotly from the plotly package to make the plot interactive.
p <- ggplotly(p)
p
cust_order_monthly <- cust_order %>%
select(order_date, number_orders) %>%
group_by(month = lubridate::floor_date(`order_date`, 'month')) %>%
summarise("no_orders_monthly" = sum(number_orders))
p <- ggplot(cust_order_monthly , aes(x = month, y = no_orders_monthly)) +
geom_bar(stat = "identity", fill = "steelblue") +
theme_minimal() +
labs(title = "Monthly orders from the 'Cust_order' Table", x = "Order Date", y = "No.Orders") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p
Making the plot interactive -
p <- ggplotly(p)
p
https://rstudio.github.io/dygraphs/ - An extremely useful way of visualising longitudinal data in an interactive, highly-customisable manner.
don <- xts(x = cust_order_monthly$no_orders_monthly, order.by = cust_order_monthly$month)
p <- dygraph(don,xlab = "Order Date", ylab = "No.Orders",main = "Monthly No.Orders from 'cust_order' Table ") %>%
dySeries("V1", label = "No.Orders") %>%
dyRangeSelector() %>%
dyUnzoom()
p
#Create 'Unzoom' function#
dyUnzoom <-function(dygraph) {
dyPlugin(
dygraph = dygraph,
name = "Unzoom",
path = system.file("plugins/unzoom.js", package = "dygraphs")
)
}
cust_order_daily <- cust_order %>%
select(order_date, number_orders) %>%
group_by(day = lubridate::floor_date(`order_date`, 'day')) %>%
summarise("no_orders_daily" = sum(number_orders))
cust_order_daily$day <- as.POSIXct(cust_order_daily$day)
don <- xts(x = cust_order_daily$no_orders_daily, order.by = cust_order_daily$day)
p <- dygraph(don,xlab = "Order Date", ylab = "No.Orders",main = "Daily No.Orders from 'cust_order' Table ") %>%
dySeries("V1", label = "No.Orders") %>%
dyRangeSelector() %>%
dyUnzoom()
p
The time series can be ‘filled’ using the dyoptions() argument ‘fillgraph’ (TRUE/FALSE).
p <- dygraph(don,xlab = "Order Date", ylab = "No.Orders",main = "Daily No.Orders from 'cust_order' Table ") %>%
dySeries("V1", label = "No.Orders") %>%
dyOptions(fillGraph = TRUE, fillAlpha = 0.1) %>%
dyRangeSelector() %>%
dyUnzoom()
p
The function dyEvent() allows vertical lines with a label to be added. dyAnnotation() generates annotations directly on the time series line. Using a custom function, a larger annotation and attached to the x-axis is also possible, presAnnotation() . dyshading() allows whole areas of the graph to be shaded -
presAnnotation <- function(dygraph, x, text) {
dygraph %>%
dyAnnotation(x, text, attachAtBottom = TRUE, width = 60)
}
p <- dygraph(don,xlab = "Order Date", ylab = "No.Orders",main = "Daily No.Orders from 'cust_order' Table ") %>%
dySeries("V1", label = "No.Orders") %>%
dyOptions(fillGraph = TRUE, fillAlpha = 0.1) %>%
dyRangeSelector() %>%
dyEvent("2023-08-24", "Andy's Order", labelLoc = "bottom") %>%
dyEvent("2023-10-06", "Bob's Order", labelLoc = "bottom") %>%
dyAnnotation("2023-09-13", text = "Sep.", tooltip = "September 241 Offer") %>%
dyAnnotation("2023-11-13", text = "Nov.",tooltip = "November Multibuy Offer" ) %>%
presAnnotation("2024-05-29", text = "BHol" ) %>%
dyShading(from = "2024-04-01", to = "2024-05-01", color = "#CCEBD6") %>%
dyUnzoom()
p
The function dyLimit() allows line limits to be added.
p <- dygraph(don,xlab = "Order Date", ylab = "No.Orders",main = "Daily No.Orders from 'cust_order' Table ") %>%
dySeries("V1", label = "No.Orders") %>%
dyOptions(fillGraph = TRUE, fillAlpha = 0.1) %>%
dyRangeSelector() %>%
dyLimit(8, color = "blue") %>%
dyLimit(40, color = "blue") %>%
dyUnzoom()
p
This section and the following take inspiration but not much content, from the LinkedIn learning course “Intermediate SQL for Data Scientists” - https://www.linkedin.com/learning/intermediate-sql-for-data-scientists
knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")
Indexes in SQLite are structures designed to improve the speed of data retrieval. They’re similar to indexes in a book, allowing for faster access to rows in a table based on the values of one or more columns.
Types of Indexes -
Single-Column Index - Created on a single column.
Multi-Column Index - Created on two or more columns.
Unique Index - Ensures that all values in the indexed column(s) are unique.
Primary Key Index - Automatically created when a table has a primary key constraint.
Automatic Indexes - Sometimes SQLite creates indexes automatically to optimize query performance, particularly for joins.
CREATE INDEX Customer_IDs ON cust_order(customer_id);
CREATE INDEX Customer_Orders ON cust_order(order_id, customer_id);
CREATE UNIQUE INDEX unique_countries ON country(country_name);
This SQL code above fails - A unique index does not discard non-unique values or automatically filter out duplicates. Instead, it enforces a constraint that prevents duplicates from being inserted into the table in the first place.
CREATE UNIQUE INDEX unique_status_ids ON order_status(status_id);
List all indexes associated with a table -
PRAGMA index_list(cust_order);
| seq | name | unique | origin | partial |
|---|---|---|---|---|
| 0 | Customer_Orders | 0 | c | 0 |
| 1 | Customer_IDs | 0 | c | 0 |
Get information about a specific index -
PRAGMA index_info(Customer_Orders);
| seqno | cid | name |
|---|---|---|
| 0 | 0 | order_id |
| 1 | 2 | customer_id |
Indexes can be dropped using the DROP INDEX statement -
DROP INDEX Customer_Orders;
Checking the index deletion has worked -
PRAGMA index_list(cust_order);
| seq | name | unique | origin | partial |
|---|---|---|---|---|
| 0 | Customer_IDs | 0 | c | 0 |
An example with a more complex and intricate query -
select * from book
| book_id | title | isbn13 | language_id | num_pages | publication_date | publisher_id |
|---|---|---|---|---|---|---|
| 1 | The World’s First Love: Mary Mother of God | 8987059752 | 2 | 276 | 1996-09-01 | 1010 |
| 2 | The Illuminati | 20049130001 | 1 | 352 | 2004-10-04 | 1967 |
| 3 | The Servant Leader | 23755004321 | 1 | 128 | 2003-03-11 | 1967 |
| 4 | What Life Was Like in the Jewel in the Crown: British India AD 1600-1905 | 34406054602 | 1 | 168 | 1999-09-01 | 1978 |
| 5 | Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frogs | 49086007763 | 1 | 80 | 1983-12-29 | 416 |
| 6 | Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II | 73999140774 | 1 | 298 | 2000-04-01 | 96 |
| 7 | William Goldman: Four Screenplays | 73999254907 | 2 | 504 | 2000-05-01 | 95 |
| 8 | The Season: A Candid Look at Broadway | 73999768442 | 1 | 448 | 2004-07-01 | 1136 |
| 9 | The Beatles Complete - Updated Edition | 73999960822 | 1 | 303 | 1986-12-01 | 835 |
| 10 | Working Effectively with Legacy Code | 76092025986 | 1 | 464 | 2004-09-01 | 1591 |
select author_name from author where author_name == "Walter Scott"
| author_name |
|---|
| Walter Scott |
Creating an index of author names.
CREATE INDEX Author_names ON author(author_name);
Now our ‘sever’ should use that index to optimise the performance of this query.
SELECT
b.title,
b.isbn13,
b.num_pages,
b.publication_date,
a.author_name,
ol.price
FROM book b
INNER JOIN book_author ba ON b.book_id = ba.book_id
INNER JOIN author a ON ba.author_id = a.author_id
INNER JOIN order_line ol ON b.book_id = ol.book_id
INNER JOIN cust_order co ON ol.order_id = co.order_id
WHERE a.author_name IN ('J.K. Rowling', 'Bill Bryson')
AND strftime('%Y', b.publication_date) IN ('1997','1998', '1999','2000')
ORDER BY ol.price DESC;
| title | isbn13 | num_pages | publication_date | author_name | price |
|---|---|---|---|---|---|
| Harry Potter and the Goblet of Fire (Harry Potter #4) | 9.780748e+12 | 636 | 2000-07-08 | J.K. Rowling | 15.04 |
| Bill Bryson: The Complete Notes | 9.780386e+12 | 544 | 2000-10-05 | Bill Bryson | 13.73 |
| The Lost Continent: Travels in Small-town America | 9.780553e+12 | 379 | 1999-01-02 | Bill Bryson | 12.07 |
| Harry Potter and the Sorcerer’s Stone (Harry Potter #1) | 9.780786e+12 | 424 | 1999-11-12 | J.K. Rowling | 9.99 |
| The Lost Continent: Travels in Small-town America | 9.780553e+12 | 379 | 1999-01-02 | Bill Bryson | 9.60 |
| I’m a Stranger Here Myself: Notes on Returning to America After Twenty Years Away | 9.780768e+12 | 304 | 2000-06-28 | Bill Bryson | 9.40 |
| Bill Bryson: The Complete Notes | 9.780386e+12 | 544 | 2000-10-05 | Bill Bryson | 2.71 |
| Notes from a Small Island | 9.780381e+12 | 324 | 1997-05-28 | Bill Bryson | 1.87 |
Views in SQLite are virtual tables that provide a way to represent the results of a query as a table.
Reasons to use Views -
Simplify Complex Queries - By encapsulating complex joins and calculations within a view, queries can be simpler and more understandable.
Enhance Security - Views can restrict access to specific data by exposing only certain columns or rows to users.
Provide Abstraction - Offer a layer of abstraction, allowing changes in the underlying database schema without affecting the end users.
Improve Maintainability - Views centralise query logic, making the system easier to maintain and modify.
CREATE VIEW Bryson_Books AS
SELECT
b.title,
b.isbn13,
b.num_pages,
b.publication_date,
a.author_name,
ol.price
FROM book b
INNER JOIN book_author ba ON b.book_id = ba.book_id
INNER JOIN author a ON ba.author_id = a.author_id
INNER JOIN order_line ol ON b.book_id = ol.book_id
INNER JOIN cust_order co ON ol.order_id = co.order_id
WHERE a.author_name IN ('Bill Bryson');
CREATE VIEW Rowling_Books AS
SELECT
b.title,
b.isbn13,
b.num_pages,
b.publication_date,
a.author_name,
ol.price
FROM book b
INNER JOIN book_author ba ON b.book_id = ba.book_id
INNER JOIN author a ON ba.author_id = a.author_id
INNER JOIN order_line ol ON b.book_id = ol.book_id
INNER JOIN cust_order co ON ol.order_id = co.order_id
WHERE a.author_name IN ('J.K. Rowling');
CREATE VIEW Walter_Scott_Books AS
SELECT
b.title,
b.isbn13,
b.num_pages,
b.publication_date,
a.author_name,
ol.price
FROM book b
INNER JOIN book_author ba ON b.book_id = ba.book_id
INNER JOIN author a ON ba.author_id = a.author_id
INNER JOIN order_line ol ON b.book_id = ol.book_id
INNER JOIN cust_order co ON ol.order_id = co.order_id
WHERE a.author_name IN ('Walter Scott');
SELECT * FROM Bryson_Books;
| title | isbn13 | num_pages | publication_date | author_name | price |
|---|---|---|---|---|---|
| Bizarre World | 9.780752e+12 | 120 | 2001-05-01 | Bill Bryson | 17.32 |
| The Lost Continent: Travels in Small Town America | 9.780061e+12 | 299 | 1990-08-28 | Bill Bryson | 14.36 |
| Neither Here nor There: Travels in Europe | 9.780381e+12 | 254 | 1993-03-28 | Bill Bryson | 12.47 |
| Made in America: An Informal History of the English Language in the United States | 9.780381e+12 | 364 | 2001-10-23 | Bill Bryson | 9.28 |
| Notes from a Small Island | 9.780381e+12 | 324 | 1997-05-28 | Bill Bryson | 1.87 |
| Bill Bryson: The Complete Notes | 9.780386e+12 | 544 | 2000-10-05 | Bill Bryson | 2.71 |
| Bill Bryson: The Complete Notes | 9.780386e+12 | 544 | 2000-10-05 | Bill Bryson | 13.73 |
| A Short History of Nearly Everything (Illustrated Edition) | 9.780386e+12 | 624 | 2010-10-05 | Bill Bryson | 5.64 |
| A Short History of Nearly Everything (Illustrated Edition) | 9.780386e+12 | 624 | 2010-10-05 | Bill Bryson | 19.05 |
| Down Under | 9.780553e+12 | 398 | 2001-08-06 | Bill Bryson | 10.92 |
SELECT * FROM Walter_Scott_Books;
| title | isbn13 | num_pages | publication_date | author_name | price |
|---|---|---|---|---|---|
| The Antiquary | 9.780193e+12 | 528 | 2002-05-23 | Walter Scott | 10.66 |
| The Antiquary | 9.780193e+12 | 528 | 2002-05-23 | Walter Scott | 15.75 |
| The Antiquary | 9.780193e+12 | 528 | 2002-05-23 | Walter Scott | 0.84 |
| The Antiquary | 9.780193e+12 | 528 | 2002-05-23 | Walter Scott | 13.99 |
| The Castle of Otranto | 9.780193e+12 | 125 | 1998-07-16 | Walter Scott | 3.56 |
| Waverley | 9.780193e+12 | 463 | 1998-08-20 | Walter Scott | 1.96 |
Views themselves are not directly updatable, but they can be dropped and recreated -
DROP VIEW IF EXISTS Bryson_Books;
SELECT name FROM sqlite_master WHERE type='view';
| name |
|---|
| Rowling_Books |
| Walter_Scott_Books |
SELECT sql FROM sqlite_master WHERE type='view' AND name='Walter_Scott_Books';
| sql |
|---|
| CREATE VIEW Walter_Scott_Books AS |
SELECT b.title, b.isbn13, b.num_pages, b.publication_date, a.author_name, ol.price FROM book b INNER JOIN book_author ba ON b.book_id = ba.book_id INNER JOIN author a ON ba.author_id = a.author_id INNER JOIN order_line ol ON b.book_id = ol.book_id INNER JOIN cust_order co ON ol.order_id = co.order_id WHERE a.author_name IN (‘Walter Scott’) |
The SUM() function is an aggregate function that calculates the total sum of a numeric column. The function is commonly used in conjunction with the GROUP BY clause to calculate sums for specific groups of data.
knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")
SELECT SUM(street_name) as "No.different Street Names", SUM(city) as "No.different Cities" FROM address;
| No.different Street Names | No.different Cities |
|---|---|
| 87 | 0 |
As street_name and city are text columns, sum won’t work properly on them. It’s best to use COUNT(DISTINCT()).
SELECT COUNT(DISTINCT street_name) as "No. of Different Street Names",
COUNT(DISTINCT city) as "No. of Different Cities"
FROM address;
| No. of Different Street Names | No. of Different Cities |
|---|---|
| 958 | 974 |
knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")
SELECT
SUM(ol.price) AS "Sum of Orders per Customer",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.first_name, c.last_name
ORDER BY
"Order per Customer" DESC;
| Sum of Orders per Customer | Full Name |
|---|---|
| 112.44 | Abbot Tesseyman |
| 31.84 | Abbott Kendrew |
| 119.68 | Abby Chevins |
| 59.10 | Abel Trower |
| 88.54 | Abigael Trowbridge |
| 177.89 | Abraham Skudder |
| 104.84 | Adah Cotty |
| 143.15 | Addie Basterfield |
| 47.75 | Addison Sigg |
| 161.51 | Adrian Kunzelmann |
The AVG() function in SQL is an aggregate function that calculates the average value of a numeric column. It sums up all the values in the column and divides by the number of non-null values, providing the mean value.
SELECT
SUM(ol.price) AS "Sum of Orders per Customer",
AVG(ol.price) AS "Average Order Price per Customer",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.first_name, c.last_name
ORDER BY
"Order per Customer" DESC;
| Sum of Orders per Customer | Average Order Price per Customer | Full Name |
|---|---|---|
| 112.44 | 11.244000 | Abbot Tesseyman |
| 31.84 | 10.613333 | Abbott Kendrew |
| 119.68 | 13.297778 | Abby Chevins |
| 59.10 | 7.387500 | Abel Trower |
| 88.54 | 8.854000 | Abigael Trowbridge |
| 177.89 | 9.362632 | Abraham Skudder |
| 104.84 | 7.488571 | Adah Cotty |
| 143.15 | 10.225000 | Addie Basterfield |
| 47.75 | 9.550000 | Addison Sigg |
| 161.51 | 9.500588 | Adrian Kunzelmann |
The ROUND() function in SQL is used to round a numeric value to a specified number of decimal places. It takes two arguments: the number to be rounded and the number of decimal places to round to.
SELECT
SUM(ol.price) AS "Sum of Orders per Customer",
ROUND(AVG(ol.price),2) AS "Average Order Price per Customer",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.first_name, c.last_name
ORDER BY
"Order per Customer" DESC;
| Sum of Orders per Customer | Average Order Price per Customer | Full Name |
|---|---|---|
| 112.44 | 11.24 | Abbot Tesseyman |
| 31.84 | 10.61 | Abbott Kendrew |
| 119.68 | 13.30 | Abby Chevins |
| 59.10 | 7.39 | Abel Trower |
| 88.54 | 8.85 | Abigael Trowbridge |
| 177.89 | 9.36 | Abraham Skudder |
| 104.84 | 7.49 | Adah Cotty |
| 143.15 | 10.22 | Addie Basterfield |
| 47.75 | 9.55 | Addison Sigg |
| 161.51 | 9.50 | Adrian Kunzelmann |
The VARIANCE() function is used to calculate the statistical variance of a set of numeric values, which measures the dispersion of the values from their mean. However, SQLite does not have a built-in VARIANCE() function.
To calculate variance in SQLite, one can use a combination of SQL functions.
The sample variance estimates the variance from a sample of the population. The formula is -
\[ \sigma\^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1} \]
(\[\sigma\^2\]) is the population variance
(\[x_i\]) represents each data point
(\[\bar{x}\]) is the sample mean
(\[n - 1\]) is the total number of data points in the population
Used when there’s only a sample and need the population variance needs to be estimated. The denominator is \[n - 1\] . \[n - 1\] instead of \[n\] to correct the bias in the estimation of the population variance from a sample (Bessel’s correction).
The SQL code -
SELECT SUM((value - avg_value) * (value - avg_value)) / (COUNT(*) - 1) AS sample_variance FROM ( SELECT value, AVG(value) AS avg_value FROM table_name ) AS subquery;
WITH OrderStats AS (
SELECT
c.customer_id,
SUM(ol.price) AS "Sum of Orders per Customer",
AVG(ol.price) AS "Average Order Price per Customer",
COUNT(ol.price) AS "Order Count",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.customer_id, c.first_name, c.last_name
)
SELECT
os."Sum of Orders per Customer",
ROUND(os."Average Order Price per Customer", 2) AS "Average Order Price per Customer",
os."Full Name",
ROUND(SUM((ol.price - os."Average Order Price per Customer") * (ol.price - os."Average Order Price per Customer")) / (os."Order Count" - 1), 2) AS "Order Price Sample Variance"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
INNER JOIN
OrderStats os ON c.customer_id = os.customer_id
GROUP BY
os.customer_id, os."Sum of Orders per Customer", os."Average Order Price per Customer", os."Full Name"
ORDER BY
"Sum of Orders per Customer" DESC;
| Sum of Orders per Customer | Average Order Price per Customer | Full Name | Order Price Sample Variance |
|---|---|---|---|
| 711.41 | 10.94 | Rich Kirsz | 32.82 |
| 663.41 | 10.21 | Farand Tremmil | 34.46 |
| 635.58 | 10.42 | Renado Sherrington | 34.50 |
| 622.31 | 12.20 | Lynda Roseborough | 26.15 |
| 611.59 | 10.54 | Mick Sever | 38.09 |
| 596.18 | 11.25 | Alysa Crombleholme | 29.67 |
| 583.81 | 9.73 | Emylee Hubbert | 37.33 |
| 574.20 | 10.07 | La verne Figg | 30.94 |
| 561.26 | 9.85 | Zora Hurles | 42.98 |
| 549.63 | 10.78 | Penny Bovingdon | 37.10 |
The population variance measures the dispersion of all data points in a population from the population mean ((\[\mu\])). The formula is -
\[ \sigma\^2 = \frac{\sum (x_i - \mu)^2}{N} \]
(\[\sigma\^2\]) is the population variance
(\[x_i\]) represents each data point
(\[\mu\]) is the mean of the population
(\[n\]) is the total number of data points in the population
With large datasets, the difference between sample and population variance is minimal since the correction factor (n - 1) vs. (n) has a smaller impact as the sample size increases.
The SQL code -
SELECT AVG((value - avg_value) * (value - avg_value)) AS population_variance FROM ( SELECT value, AVG(value) AS avg_value FROM table_name ) AS subquery;
WITH OrderStats AS (
SELECT
c.customer_id,
SUM(ol.price) AS "Sum of Orders per Customer",
AVG(ol.price) AS "Average Order Price per Customer",
COUNT(ol.price) AS "Order Count",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.customer_id, c.first_name, c.last_name
)
SELECT
os."Sum of Orders per Customer",
ROUND(os."Average Order Price per Customer", 2) AS "Average Order Price per Customer",
os."Full Name",
ROUND(SUM((ol.price - os."Average Order Price per Customer") * (ol.price - os."Average Order Price per Customer")) / os."Order Count", 2) AS "Order Price Population Variance"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
INNER JOIN
OrderStats os ON c.customer_id = os.customer_id
GROUP BY
os.customer_id, os."Sum of Orders per Customer", os."Average Order Price per Customer", os."Full Name"
ORDER BY
"Sum of Orders per Customer" DESC;
| Sum of Orders per Customer | Average Order Price per Customer | Full Name | Order Price Population Variance |
|---|---|---|---|
| 711.41 | 10.94 | Rich Kirsz | 32.31 |
| 663.41 | 10.21 | Farand Tremmil | 33.93 |
| 635.58 | 10.42 | Renado Sherrington | 33.94 |
| 622.31 | 12.20 | Lynda Roseborough | 25.64 |
| 611.59 | 10.54 | Mick Sever | 37.44 |
| 596.18 | 11.25 | Alysa Crombleholme | 29.11 |
| 583.81 | 9.73 | Emylee Hubbert | 36.71 |
| 574.20 | 10.07 | La verne Figg | 30.40 |
| 561.26 | 9.85 | Zora Hurles | 42.23 |
| 549.63 | 10.78 | Penny Bovingdon | 36.37 |
Standard deviation measures the dispersion of a dataset relative to its mean, indicating how spread out the data points are. A low standard deviation means the data points are close to the mean, while a high standard deviation indicates they are more spread out.
SQLite does not have a built-in STDDEV() or STDEV() function for calculating standard deviation.
These are the steps to calculate standard deviation in SQLite:
Calculate the mean of the dataset.
Compute Squared Differences calculate the squared difference of each value from the mean.
Aggregate and Calculate - Sum up the squared differences, divide by the count of values (for population standard deviation) or by (n - 1) (for sample standard deviation), and take the square root of the result.
The population standard deviation (()) is calculated using the formula:
\[ \sigma = \sqrt{\frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}} \]
where: - () is the population standard deviation, - (x_i) represents each data point, - () is the mean of the population, - (N) is the total number of data points in the population.
The SQL code -
SELECT SQRT(AVG((value - avg_value) * (value - avg_value))) AS population_stddev FROM ( SELECT value, AVG(value) OVER () AS avg_value FROM table_name ) AS subquery;
WITH OrderStats AS (
SELECT
c.customer_id,
SUM(ol.price) AS "Sum of Orders per Customer",
AVG(ol.price) AS "Average Order Price per Customer",
COUNT(ol.price) AS "Order Count",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.customer_id, c.first_name, c.last_name
)
SELECT
os."Sum of Orders per Customer",
ROUND(os."Average Order Price per Customer", 2) AS "Average Order Price per Customer",
os."Full Name",
ROUND(SUM((ol.price - os."Average Order Price per Customer") * (ol.price - os."Average Order Price per Customer")) / os."Order Count", 2) AS "Order Price Population Variance",
ROUND(SQRT(SUM((ol.price - os."Average Order Price per Customer") * (ol.price - os."Average Order Price per Customer")) / os."Order Count"), 2) AS "Order Price Standard Deviation"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
INNER JOIN
OrderStats os ON c.customer_id = os.customer_id
GROUP BY
os.customer_id, os."Sum of Orders per Customer", os."Average Order Price per Customer", os."Full Name"
ORDER BY
"Sum of Orders per Customer" DESC;
| Sum of Orders per Customer | Average Order Price per Customer | Full Name | Order Price Population Variance | Order Price Standard Deviation |
|---|---|---|---|---|
| 711.41 | 10.94 | Rich Kirsz | 32.31 | 5.68 |
| 663.41 | 10.21 | Farand Tremmil | 33.93 | 5.82 |
| 635.58 | 10.42 | Renado Sherrington | 33.94 | 5.83 |
| 622.31 | 12.20 | Lynda Roseborough | 25.64 | 5.06 |
| 611.59 | 10.54 | Mick Sever | 37.44 | 6.12 |
| 596.18 | 11.25 | Alysa Crombleholme | 29.11 | 5.39 |
| 583.81 | 9.73 | Emylee Hubbert | 36.71 | 6.06 |
| 574.20 | 10.07 | La verne Figg | 30.40 | 5.51 |
| 561.26 | 9.85 | Zora Hurles | 42.23 | 6.50 |
| 549.63 | 10.78 | Penny Bovingdon | 36.37 | 6.03 |
The sample standard deviation ((s)) is calculated using the formula:
\[ s = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1}} \]
where: - (s) is the sample standard deviation, - (x_i) represents each data point in the sample, - ({x}) is the sample mean, - (n) is the number of data points in the sample.
The SQL code -
SELECT SQRT(SUM((value - avg_value) * (value - avg_value)) / (COUNT(*) - 1)) AS sample_stddev FROM ( SELECT value, AVG(value) OVER () AS avg_value FROM table_name ) AS subquery;
WITH SampledCustomers AS (
SELECT
c.customer_id,
c.first_name,
c.last_name
FROM
customer c
ORDER BY
RANDOM()
LIMIT 50
),
OrderStats AS (
SELECT
c.customer_id,
SUM(ol.price) AS "Sum of Orders per Customer",
AVG(ol.price) AS "Average Order Price per Customer",
COUNT(ol.price) AS "Order Count",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
SampledCustomers c ON co.customer_id = c.customer_id
GROUP BY
c.customer_id, c.first_name, c.last_name
)
SELECT
os."Sum of Orders per Customer",
ROUND(os."Average Order Price per Customer", 2) AS "Average Order Price per Customer",
os."Full Name",
ROUND(SUM((ol.price - os."Average Order Price per Customer") * (ol.price - os."Average Order Price per Customer")) / (os."Order Count" - 1), 2) AS "Order Price Sample Variance",
ROUND(SQRT(SUM((ol.price - os."Average Order Price per Customer") * (ol.price - os."Average Order Price per Customer")) / (os."Order Count" - 1)), 2) AS "Order Price Sample Standard Deviation"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
INNER JOIN
OrderStats os ON c.customer_id = os.customer_id
GROUP BY
os.customer_id, os."Sum of Orders per Customer", os."Average Order Price per Customer", os."Full Name"
ORDER BY
"Sum of Orders per Customer" DESC;
| Sum of Orders per Customer | Average Order Price per Customer | Full Name | Order Price Sample Variance | Order Price Sample Standard Deviation |
|---|---|---|---|---|
| 516.90 | 11.24 | Enrichetta Morrill | 35.84 | 5.99 |
| 383.98 | 10.67 | Chrissy Notton | 32.56 | 5.71 |
| 337.77 | 10.56 | Lianne Gorry | 32.20 | 5.67 |
| 316.29 | 9.88 | Bethina Cady | 30.98 | 5.57 |
| 302.11 | 9.15 | Lucas Wyldbore | 39.15 | 6.26 |
| 288.05 | 8.73 | Rustin Cadden | 39.63 | 6.30 |
| 259.36 | 8.94 | Rodd Diplock | 23.67 | 4.87 |
| 254.38 | 11.56 | Morris Morales | 42.79 | 6.54 |
| 180.90 | 8.61 | Fania Jeanesson | 27.50 | 5.24 |
| 170.70 | 10.67 | Danice Matthiesen | 37.23 | 6.10 |
WITH(), also known as Common Table Expressions (CTEs), allows for improved readability and reusability of SQL queries. It’s particularly useful for breaking down complex queries into simpler, more manageable parts by creating temporary result sets that can be referenced within the main query.
WITH cte_name AS ( – CTE Query SELECT … ) SELECT … FROM cte_name
Key Differences
Subqueries - Can be less readable, especially when nested.
WITH() - Provide a clearer, more structured approach by defining temporary tables with meaningful names.
Subqueries - Generally not reusable; you need to repeat the subquery if it’s used in multiple places.
WITH() - Reusable within the main query, reducing redundancy and improving maintainability.
Subqueries - Harder to debug and maintain due to their nested nature.
WITH() - Easier to debug and maintain due to their clear, modular structure.
When using WITH()
WITH CustomerOrders AS (
SELECT
co.customer_id,
co.order_date,
ol.order_id,
SUM(ol.price) AS "Total Order Amount"
FROM
cust_order co
INNER JOIN
order_line ol ON co.order_id = ol.order_id
WHERE
co.order_date BETWEEN '2023-11-01' AND '2024-05-01'
GROUP BY
co.customer_id, co.order_date, ol.order_id
)
SELECT
c.customer_id AS "Customer ID",
c.order_date AS "Order Date",
COUNT(c.order_id) AS "Number of Orders",
SUM(c."Total Order Amount") AS "Total Amount Spent",
AVG(c."Total Order Amount") AS "Average Order Amount"
FROM
CustomerOrders c
GROUP BY
c.customer_id
ORDER BY
c.order_date DESC;
| Customer ID | Order Date | Number of Orders | Total Amount Spent | Average Order Amount |
|---|---|---|---|---|
| 383 | 2024-04-30 16:41:12 | 1 | 28.12 | 28.12 |
| 1013 | 2024-04-29 01:27:43 | 1 | 12.34 | 12.34 |
| 1436 | 2024-04-27 15:36:45 | 1 | 15.86 | 15.86 |
| 933 | 2024-04-27 00:04:23 | 1 | 15.67 | 15.67 |
| 975 | 2024-04-26 09:37:25 | 1 | 17.40 | 17.40 |
| 557 | 2024-04-24 21:49:03 | 1 | 1.16 | 1.16 |
| 1002 | 2024-04-24 18:23:20 | 1 | 26.49 | 26.49 |
| 1103 | 2024-04-23 05:28:36 | 1 | 8.94 | 8.94 |
| 1157 | 2024-04-23 02:15:30 | 1 | 52.09 | 52.09 |
| 110 | 2024-04-22 16:05:31 | 1 | 30.88 | 30.88 |
The part with the WITH inner query -
This CTE calculates the total amount spent on each order for every customer within the specified date range.
It joins the cust_order and order_line tables and groups the data by customer ID, order date, and order ID to compute the total order amount.
The rest, the outer query -
The main query selects from the CustomerOrders CTE.
It calculates the number of orders, total amount spent, and average order amount for each customer.
The results are grouped by customer ID and ordered by the order date in descending order.
When not using WITH()
SELECT
co.customer_id AS "Customer ID",
co.order_date AS "Order Date",
COUNT(orders_per_customer.order_id) AS "Number of Orders",
SUM(orders_per_customer."Total Order Amount") AS "Total Amount Spent",
AVG(orders_per_customer."Total Order Amount") AS "Average Order Amount"
FROM
cust_order co
INNER JOIN (
SELECT
co_inner.customer_id,
co_inner.order_date,
ol_inner.order_id,
SUM(ol_inner.price) AS "Total Order Amount"
FROM
cust_order co_inner
INNER JOIN
order_line ol_inner ON co_inner.order_id = ol_inner.order_id
WHERE
co_inner.order_date BETWEEN '2023-11-01' AND '2024-05-01'
GROUP BY
co_inner.customer_id, co_inner.order_date, ol_inner.order_id
) AS orders_per_customer ON co.customer_id = orders_per_customer.customer_id AND co.order_date = orders_per_customer.order_date
WHERE
co.order_date BETWEEN '2023-11-01' AND '2024-05-01'
GROUP BY
co.customer_id, co.order_date
ORDER BY
co.order_date DESC;
| Customer ID | Order Date | Number of Orders | Total Amount Spent | Average Order Amount |
|---|---|---|---|---|
| 1240 | 2024-04-30 23:45:50 | 1 | 38.34 | 38.34 |
| 241 | 2024-04-30 23:18:12 | 1 | 17.15 | 17.15 |
| 155 | 2024-04-30 21:43:09 | 1 | 50.45 | 50.45 |
| 59 | 2024-04-30 19:32:57 | 1 | 16.54 | 16.54 |
| 79 | 2024-04-30 19:09:28 | 1 | 3.81 | 3.81 |
| 109 | 2024-04-30 19:01:49 | 1 | 29.51 | 29.51 |
| 1367 | 2024-04-30 18:50:07 | 1 | 31.52 | 31.52 |
| 21 | 2024-04-30 17:35:08 | 1 | 9.77 | 9.77 |
| 45 | 2024-04-30 17:34:56 | 1 | 9.68 | 9.68 |
| 383 | 2024-04-30 16:41:12 | 1 | 28.12 | 28.12 |
The part with the INNER JOIN inner query -
The inner query calculates the total amount spent on each order for every customer within the specified date range.
This inner query is essentially the same as the WITH part in the previous example.
The rest, the outer query -
The outer query joins the cust_order table with the result of the inner query (orders_per_customer).
It calculates the number of orders, total amount spent, and average order amount for each customer, grouped by customer ID and order date.
The results are ordered by the order date in descending order.
The advantages to using WITH()
Readability - Using the WITH clause (CTE) makes the query more readable.In the non-CTE version, the nested query can be harder to follow.
Maintainability - The CTE version is more maintainable because each part of the query is isolated. If you need to adjust the calculation logic, it’s clearer where to make changes.
Length - The non-CTE version tends to be more long-winded, as it requires embedding the subquery directly in the FROM clause, making the overall query longer and potentially more confusing.
knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")
The LIKE operator is used for pattern matching within text fields. It allows the use of % to represent zero or more characters and _ to represent a single character. For example, the query SELECT * FROM customers WHERE name LIKE '%enko'; will return all rows where the name column ends with “enko”.
Getting all surnames starting with ‘Mac’ -
select * from customer where last_name like 'Mac%'
| customer_id | first_name | last_name | |
|---|---|---|---|
| 114 | Aurelie | MacSherry | amacsherry35@is.gd |
| 211 | Desiri | MacDunleavy | dmacdunleavy5u@behance.net |
| 299 | Yvonne | Maccree | ymaccree8a@mac.com |
| 416 | Zaria | MacCafferky | zmaccafferkybj@trellian.com |
| 481 | Kaleb | MacClenan | kmacclenandc@ocn.ne.jp |
| 500 | Aurelia | MacGowan | amacgowandv@opera.com |
| 604 | Agace | MacKinnon | amackinnongr@live.com |
| 951 | Wilma | MacGorrie | wmacgorrieqe@nba.com |
| 979 | Danell | Mace | dmacer6@clickbank.net |
| 994 | Archibold | MacNab | amacnabrl@nih.gov |
Getting all surnames starting with ‘Mc’ -
select * from customer where last_name like 'Mc%'
| customer_id | first_name | last_name | |
|---|---|---|---|
| 73 | Ruthi | McGeever | rmcgeever20@sakura.ne.jp |
| 168 | Lynsey | McPeice | lmcpeice4n@technorati.com |
| 187 | Miner | McLay | mmclay56@plala.or.jp |
| 199 | Bjorn | McCloud | bmccloud5i@aboutads.info |
| 317 | Stesha | McAlees | smcalees8s@wikispaces.com |
| 319 | Win | McManamon | wmcmanamon8u@ovh.net |
| 347 | Ephrem | Mc Ilwrick | emcilwrick9m@wikimedia.org |
| 410 | Rriocard | McPhail | rmcphailbd@huffingtonpost.com |
| 539 | Jane | McCreath | jmccreathey@samsung.com |
| 772 | Tybalt | McOwen | tmcowenlf@storify.com |
Getting all surnames ending with ‘vich’ -
select * from customer where last_name like '%vich'
| customer_id | first_name | last_name | |
|---|---|---|---|
| 578 | Lynea | Matskevich | lmatskevichg1@japanpost.jp |
| 1545 | Deana | Matusevich | dmatusevichf4@multiply.com |
Getting all surnames ending with ‘enko’ -
select * from customer where last_name like '%enko'
| customer_id | first_name | last_name | |
|---|---|---|---|
| 174 | Zachery | Hriinchenko | zhriinchenko4t@ocn.ne.jp |
| 1062 | Harold | Izacenko | hizacenko1p@pen.io |
Getting all surnames starting or ending with ‘man’ (lower or upper case) -
select * from customer where last_name like '%man%'
| customer_id | first_name | last_name | |
|---|---|---|---|
| 218 | Abbot | Tesseyman | atesseyman61@hc360.com |
| 221 | Parker | Strangman | pstrangman64@barnesandnoble.com |
| 319 | Win | McManamon | wmcmanamon8u@ovh.net |
| 360 | Dulci | Portman | dportman9z@phoca.cz |
| 366 | Giulia | Borrowman | gborrowmana5@github.com |
| 402 | Charlean | Palphramand | cpalphramandb5@amazonaws.com |
| 413 | Sigfried | Mansel | smanselbg@about.com |
| 430 | Adrian | Kunzelmann | akunzelmannbx@state.gov |
| 483 | Antonie | Liebmann | aliebmannde@blogs.com |
| 501 | Salome | Elliman | sellimandw@1688.com |
Making a new column specifically based on the like condition -
SELECT
last_name as "Customer Surname",
CASE
WHEN last_name LIKE '%enko'
OR last_name LIKE '%vich'
OR last_name LIKE '%vych'
OR last_name LIKE '%chuk'
OR last_name LIKE '%chyk'
OR last_name LIKE '%ski'
OR last_name LIKE '%sky'
OR last_name LIKE '%uk'
OR last_name LIKE '%ko'
OR last_name LIKE '%yshyn'
OR last_name LIKE '%iv' THEN 'Yes'
ELSE 'No'
END AS 'Potentially Ukrainian Surname?'
FROM customer
WHERE last_name LIKE '%enko'
OR last_name LIKE '%vich'
OR last_name LIKE '%vych'
OR last_name LIKE '%chuk'
OR last_name LIKE '%chyk'
OR last_name LIKE '%ski'
OR last_name LIKE '%sky'
OR last_name LIKE '%uk'
OR last_name LIKE '%ko'
OR last_name LIKE '%yshyn'
OR last_name LIKE '%iv';
| Customer Surname | Potentially Ukrainian Surname? |
|---|---|
| Hriinchenko | Yes |
| Davidofski | Yes |
| Bagniuk | Yes |
| Georgievski | Yes |
| Matskevich | Yes |
| Malinowski | Yes |
| Matschuk | Yes |
| Izacenko | Yes |
| Dmiterko | Yes |
| Matusevich | Yes |
Unfortunately, SQLite that’s being used here does not support the command ‘similar to’ that could be used in PostGreSQL.
If PostGreSQL was being used, the above query could be shortened a lot -
SELECT last_name, CASE WHEN last_name SIMILAR TO ‘%(enko|vich|vych|chuk|chyk|ski|sky|uk|ko|yshyn|iv)’ THEN ‘YES’ ELSE ‘NO’ END AS “Ukrainian Surname” FROM customer WHERE last_name SIMILAR TO ‘%(enko|vich|vych|chuk|chyk|ski|sky|uk|ko|yshyn|iv)’;
knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")
SOUNDEX is a phonetic algorithm that indexes words by their sound when pronounced in English. This can be useful for matching words that sound similar but are spelled differently. SQLite has this function.
SELECT SOUNDEX('George');
| SOUNDEX(‘George’) |
|---|
| G620 |
SELECT last_name AS 'Surnames'
FROM customer
WHERE SOUNDEX(last_name) = SOUNDEX('Smyth');
| Surnames |
|---|
| Smoth |
| Sineath |
select soundex('Postgres'), soundex('Postgresss'), ('Postgres' = 'Postgresss'),
soundex('Postgres') = soundex('Postgresss')
| soundex(‘Postgres’) | soundex(‘Postgresss’) | (‘Postgres’ = ‘Postgresss’) | soundex(‘Postgres’) = soundex(‘Postgresss’) |
|---|---|---|---|
| P232 | P232 | 0 | 1 |
A SOUNDEX code consists of a letter followed by three digits, representing the phonetic pattern of the word. Here’s how it is constructed:
First Letter: The first letter of the word is kept as the first letter of the SOUNDEX code.
Digits: The remaining letters are converted to digits based on their phonetic sound:
B, F, P, V → 1
C, G, J, K, Q, S, X, Z → 2
D, T → 3
L → 4
M, N → 5
R → 6
Similar Sounds: Adjacent letters that represent the same sound are collapsed into a single digit.
Vowels and Certain Letters: A, E, I, O, U, H, W, and Y are ignored unless they are the first letter.
Truncation/Zero Padding: The code is truncated to four characters if necessary, or zero-padded to ensure it is four characters long.
For example, the SOUNDEX code “P232” is generated from the word “Postgres” as follows:
‘P’ is the first letter.
‘o’ is ignored.
‘s’ maps to 2.
‘t’ maps to 3.
‘g’ maps to 2.
Remaining letters (‘r’, ‘e’, ‘s’) are either ignored or do not change the pattern as the code is already four characters long.
Thus, “Postgres” becomes “P232”.
select difference ('Postgres', 'Postgresss') as "Difference between the strings Postgres and Postgresss"
| Difference between the strings Postgres and Postgresss |
|---|
| 4 |
The DIFFERENCE() function in SQL compares the SOUNDEX values of two strings and returns an integer value between 0 and 4, indicating the degree of similarity between the two strings. A result of 4 means the strings sound very similar, while a result of 0 means they sound very different. This function is particularly useful for fuzzy matching in text searches.
The levenshtein() function calculates the Levenshtein distance between two strings, which is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into the other. It is commonly used to measure the similarity between two strings, with a lower distance indicating greater similarity.
SQLite doesn’t have it unfortunately!
knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")
SELECT
SUM(ol.price) AS "Sum of Orders per Customer",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.first_name, c.last_name
ORDER BY
"Sum of Orders per Customer" DESC;
| Sum of Orders per Customer | Full Name |
|---|---|
| 711.41 | Rich Kirsz |
| 663.41 | Farand Tremmil |
| 635.58 | Renado Sherrington |
| 622.31 | Lynda Roseborough |
| 611.59 | Mick Sever |
| 596.18 | Alysa Crombleholme |
| 583.81 | Emylee Hubbert |
| 574.20 | La verne Figg |
| 561.26 | Zora Hurles |
| 549.63 | Penny Bovingdon |
The HAVING clause is used to specify a condition for groups of rows created by the GROUP BY clause, similar to how the WHERE clause is used to specify a condition for individual rows.
SELECT
SUM(ol.price) AS "Sum of Orders per Customer",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.first_name, c.last_name
HAVING
"Sum of Orders per Customer" > 100
ORDER BY
"Sum of Orders per Customer" DESC;
| Sum of Orders per Customer | Full Name |
|---|---|
| 711.41 | Rich Kirsz |
| 663.41 | Farand Tremmil |
| 635.58 | Renado Sherrington |
| 622.31 | Lynda Roseborough |
| 611.59 | Mick Sever |
| 596.18 | Alysa Crombleholme |
| 583.81 | Emylee Hubbert |
| 574.20 | La verne Figg |
| 561.26 | Zora Hurles |
| 549.63 | Penny Bovingdon |
knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")
SQLite does not support the CUBE operation directly. The CUBE operation, found in SQL databases like PostgreSQL, SQL Server, and Oracle, is used for generating a result set that represents a multi-dimensional cube for aggregation purposes.
It generates a result set that represents all possible combinations of grouping columns, providing comprehensive aggregate data for every combination.
When you use CUBE, SQL automatically performs aggregations for each combination of the specified dimensions (columns).
However, you can manually create similar results in SQLite using a combination of GROUP BY queries and UNION operations to simulate a CUBE.
SELECT
b.title AS "Book Title",
a.city AS "Order Destination City",
c.country_name AS "Order Destination Country",
ol.price AS "Price"
FROM
book b
INNER JOIN
order_line ol ON ol.book_id = b.book_id
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
address a ON co.dest_address_id = a.address_id
INNER JOIN
country c ON a.country_id = c.country_id
;
| Book Title | Order Destination City | Order Destination Country | Price |
|---|---|---|---|
| The World’s First Love: Mary Mother of God | Rameshki | Russia | 10.23 |
| The Illuminati | Youwarou | Mali | 18.14 |
| The Illuminati | Pagaden | Indonesia | 16.78 |
| The Servant Leader | Trongsa | Bhutan | 7.42 |
| Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frogs | Badong | Indonesia | 18.76 |
| Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frogs | Krajan Sale | Indonesia | 1.58 |
| Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frogs | Liwu | China | 10.03 |
| Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II | Gaoshan | China | 13.72 |
| Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II | Chengkou | China | 12.25 |
| Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II | Ploso Wetan | Indonesia | 1.48 |
To replicate CUBE with the above query, the following steps are done in a query below:
Detailed Grouping - Group by Book Title, Order Destination City, and Order Destination Country.
Partial Groupings - Group by each pair of dimensions and each individual dimension. Book Title and Order Destination City, Book Title and Order Destination Country, Order Destination City and Order Destination Country, Book Title only, Order Destination City only, Order Destination Country only.
Grand Total: Aggregate without any grouping for the overall total.
-- Group by all three dimensions
SELECT
b.title AS "Book Title",
a.city AS "Order Destination City",
c.country_name AS "Order Destination Country",
SUM(ol.price) AS "Total Price"
FROM
book b
INNER JOIN
order_line ol ON ol.book_id = b.book_id
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
address a ON co.dest_address_id = a.address_id
INNER JOIN
country c ON a.country_id = c.country_id
GROUP BY
b.title, a.city, c.country_name
UNION ALL
-- Group by Book Title and Order Destination City
SELECT
b.title AS "Book Title",
a.city AS "Order Destination City",
NULL AS "Order Destination Country",
SUM(ol.price) AS "Total Price"
FROM
book b
INNER JOIN
order_line ol ON ol.book_id = b.book_id
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
address a ON co.dest_address_id = a.address_id
GROUP BY
b.title, a.city
UNION ALL
-- Group by Book Title and Order Destination Country
SELECT
b.title AS "Book Title",
NULL AS "Order Destination City",
c.country_name AS "Order Destination Country",
SUM(ol.price) AS "Total Price"
FROM
book b
INNER JOIN
order_line ol ON ol.book_id = b.book_id
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
address a ON co.dest_address_id = a.address_id
INNER JOIN
country c ON a.country_id = c.country_id
GROUP BY
b.title, c.country_name
UNION ALL
-- Group by Order Destination City and Order Destination Country
SELECT
NULL AS "Book Title",
a.city AS "Order Destination City",
c.country_name AS "Order Destination Country",
SUM(ol.price) AS "Total Price"
FROM
book b
INNER JOIN
order_line ol ON ol.book_id = b.book_id
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
address a ON co.dest_address_id = a.address_id
INNER JOIN
country c ON a.country_id = c.country_id
GROUP BY
a.city, c.country_name
UNION ALL
-- Group by Book Title only
SELECT
b.title AS "Book Title",
NULL AS "Order Destination City",
NULL AS "Order Destination Country",
SUM(ol.price) AS "Total Price"
FROM
book b
INNER JOIN
order_line ol ON ol.book_id = b.book_id
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
address a ON co.dest_address_id = a.address_id
INNER JOIN
country c ON a.country_id = c.country_id
GROUP BY
b.title
UNION ALL
-- Group by Order Destination City only
SELECT
NULL AS "Book Title",
a.city AS "Order Destination City",
NULL AS "Order Destination Country",
SUM(ol.price) AS "Total Price"
FROM
book b
INNER JOIN
order_line ol ON ol.book_id = b.book_id
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
address a ON co.dest_address_id = a.address_id
INNER JOIN
country c ON a.country_id = c.country_id
GROUP BY
a.city
UNION ALL
-- Group by Order Destination Country only
SELECT
NULL AS "Book Title",
NULL AS "Order Destination City",
c.country_name AS "Order Destination Country",
SUM(ol.price) AS "Total Price"
FROM
book b
INNER JOIN
order_line ol ON ol.book_id = b.book_id
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
address a ON co.dest_address_id = a.address_id
INNER JOIN
country c ON a.country_id = c.country_id
GROUP BY
c.country_name
UNION ALL
-- Grand total
SELECT
NULL AS "Book Title",
NULL AS "Order Destination City",
NULL AS "Order Destination Country",
SUM(ol.price) AS "Total Price"
FROM
book b
INNER JOIN
order_line ol ON ol.book_id = b.book_id
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
address a ON co.dest_address_id = a.address_id
INNER JOIN
country c ON a.country_id = c.country_id;
| Book Title | Order Destination City | Order Destination Country | Total Price |
|---|---|---|---|
| $30 Film School: How to Write Direct Produce Shoot Edit Distribute Tour With and Sell Your Own No-Budget Digital Movie | Filimonovo | Russia | 7.93 |
| $30 Film School: How to Write Direct Produce Shoot Edit Distribute Tour With and Sell Your Own No-Budget Digital Movie | Oehala | Indonesia | 16.83 |
| $30 Film School: How to Write Direct Produce Shoot Edit Distribute Tour With and Sell Your Own No-Budget Digital Movie | Oguma | Nigeria | 17.68 |
| $30 Film School: How to Write Direct Produce Shoot Edit Distribute Tour With and Sell Your Own No-Budget Digital Movie | San José Poaquil | Guatemala | 7.31 |
| 10 lb Penalty | Xinshancun | China | 7.22 |
| 100 Great Fantasy Short Short Stories | Rodotópi | Greece | 2.24 |
| 100 Great Fantasy Short Short Stories | Villa Bustos | Argentina | 5.08 |
| 100 Years of Lynchings | Taloqan | Afghanistan | 0.03 |
| 100 Years of Lynchings | Wąbrzeźno | Poland | 19.77 |
| 1000 Record Covers | Bulakbanjar | Indonesia | 10.83 |
If CUBE was available, much shorter code could be employed -
SELECT b.title AS “Book Title”, a.city AS “Order Destination City”, c.country_name AS “Order Destination Country”, SUM(ol.price) AS “Total Price” FROM book b INNER JOIN order_line ol ON ol.book_id = b.book_id INNER JOIN cust_order co ON ol.order_id = co.order_id INNER JOIN address a ON co.dest_address_id = a.address_id INNER JOIN country c ON a.country_id = c.country_id GROUP BY CUBE(b.title, a.city, c.country_name);
Why use CUBE, what’s the point?
Using the CUBE operator in SQL is valuable for data analysis.
Comprehensive Aggregation:
CUBE provides a complete set of aggregations across all combinations of specified dimensions. This includes subtotals for every combination of the dimensions and a grand total.
Simplifies Query Writing:
Instead of writing multiple GROUP BY queries with UNION ALL, a single query with CUBE handles all required groupings and aggregations.
Facilitates Data Exploration:
By generating all possible subtotals, CUBE allows for easy exploration of data across different levels of granularity.
Supports OLAP Operations:
CUBE is particularly useful in Online Analytical Processing (OLAP) systems, enabling complex analytical queries on data warehouses.
Use Cases and Next Steps After Using CUBE
Data Reporting:
Use the result set from a CUBE operation to create comprehensive reports that show detailed and aggregated information. This is useful for business intelligence tools and dashboards.
Pivot Tables:
The result can be used to create pivot tables in tools like Excel or Tableau, allowing users to interactively explore data across different dimensions.
Trend Analysis:
Analyze trends and patterns by examining the subtotals and grand totals provided by the CUBE output.
Performance Metrics:
Calculate key performance indicators (KPIs) at various levels of detail, providing insights into different aspects of the business or operation.
Anomaly Detection:
Identify anomalies or outliers by comparing aggregated data at different levels of detail.